The aim of this jupyter notebook is to provide a full explanation of a sentiment analysis executed on the "food.com" dataset. Since the dataset provides the ratings of the users it is possible to train a supervised analysis. Furthermore in the notebook are provided the results of different algorithms and models applied on same dataset and compared with some State-of-art models.
Sentiment Analysis, or Opinion Mining, is a sub-field of Natural Language Processing (NLP) that tries to identify and extract opinions within a given text. The aim of sentiment analysis is to gauge the attitude, sentiments, evaluations, attitudes and emotions of a speaker/writer based on the computational treatment of subjectivity in a text.
import nltk
import string
import fasttext
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.tools as tls
import plotly.express as px
import plotly.offline as py
import plotly.graph_objs as go
import matplotlib.pyplot as plt
import sklearn.metrics as metrics
from collections import Counter
from nltk.corpus import stopwords
from sklearn.svm import LinearSVC
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
%matplotlib inline
color = sns.color_palette()
py.init_notebook_mode(connected=True)
The prodived dataset is available at:
This dataset consists of 180K+ recipes and 700K+ recipe reviews covering 18 years of user interactions and uploads on Food.com (formerly GeniusKitchen).
The dataset provides two files, called RAW_recipes and RAW_interactions, that represents respectively the recipes, along with all the informations, and the reviews prodived by different users. Since the aim of this projects is a sentiment analysis based on the reviews it is possible to use only the RAW_interactions dataset.
recipes_path = '../provided_datasets/food.com/RAW_recipes.csv'
reviews_path = '../provided_datasets/food.com/RAW_interactions.csv'
The analyzed dataset is the one that contains all the reviews informations. As it is possible to see in the table below, a review is composed by:
reviews_df = pd.read_csv(reviews_path)
reviews_df.head()
| user_id | recipe_id | date | rating | review | |
|---|---|---|---|---|---|
| 0 | 38094 | 40893 | 2003-02-17 | 4 | Great with a salad. Cooked on top of stove for... |
| 1 | 1293707 | 40893 | 2011-12-21 | 5 | So simple, so delicious! Great for chilly fall... |
| 2 | 8937 | 44394 | 2002-12-01 | 4 | This worked very well and is EASY. I used not... |
| 3 | 126440 | 85009 | 2010-02-27 | 5 | I made the Mexican topping and took it to bunk... |
| 4 | 57222 | 85009 | 2011-10-01 | 5 | Made the cheddar bacon topping, adding a sprin... |
From the below information it is possible to get a clear overview of the complete dataset: the reviews dataset contains 1132367 unique recipes. The reviews are written by 226570 unique users and covers 231637 recipes.
print('Number of reviews in the dataset:', reviews_df.shape[0])
print('Number of unique users reviews:', len(reviews_df.groupby('user_id')['rating'].count().index))
print('Number of unique recipe reviews:', len(reviews_df.groupby('recipe_id')['rating'].count().index))
print('Number of NaN values in the dataset:', reviews_df.isna().sum().sum())
Number of reviews in the dataset: 1132367 Number of unique users reviews: 226570 Number of unique recipe reviews: 231637 Number of NaN values in the dataset: 169
The dataset contains also null values (169). It is mandatory to investigate in which of the previuos explained columns the NaN values are present. From the following informations it is possible to say that all the NaN values are contained in the review column. Since this column is mandatory for the scope of this project, the reviews without descriptions are dropped.
reviews_df.isna().sum()
user_id 0 recipe_id 0 date 0 rating 0 review 169 dtype: int64
reviews_df = reviews_df.dropna()
reviews_df.isna().sum()
user_id 0 recipe_id 0 date 0 rating 0 review 0 dtype: int64
Then, it is possible to compute the distribution of the reviews based on the number of words: from the following graph it is possible to state that the most reviews has an amount of words around 30. (31 with 17196 reviews and 33 with 17124 reviews and 30 with 17043 reviews)
reviews_count_df = reviews_df.copy()
reviews_count_df['# words'] = reviews_count_df.apply(lambda x: len(x['review'].split()), axis=1)
reviews_count_df.head()
| user_id | recipe_id | date | rating | review | # words | |
|---|---|---|---|---|---|---|
| 0 | 38094 | 40893 | 2003-02-17 | 4 | Great with a salad. Cooked on top of stove for... | 27 |
| 1 | 1293707 | 40893 | 2011-12-21 | 5 | So simple, so delicious! Great for chilly fall... | 31 |
| 2 | 8937 | 44394 | 2002-12-01 | 4 | This worked very well and is EASY. I used not... | 19 |
| 3 | 126440 | 85009 | 2010-02-27 | 5 | I made the Mexican topping and took it to bunk... | 13 |
| 4 | 57222 | 85009 | 2011-10-01 | 5 | Made the cheddar bacon topping, adding a sprin... | 12 |
fig = px.histogram(reviews_count_df, x='# words')
fig.update_layout(title_text='Number of words in reviews')
fig.show()
To simplify the process, it is possible to save the mentioned dataset in a csv file, in order to be subsequently loaded by the model.
#reviews_df.to_csv('../exported_datasets/food.com/reviews.csv', index=False, header=True)
review_path = '../exported_datasets/food.com/reviews.csv'
review_df = pd.read_csv(review_path)
review_df.head()
| user_id | recipe_id | date | rating | review | |
|---|---|---|---|---|---|
| 0 | 38094 | 40893 | 2003-02-17 | 4 | Great with a salad. Cooked on top of stove for... |
| 1 | 1293707 | 40893 | 2011-12-21 | 5 | So simple, so delicious! Great for chilly fall... |
| 2 | 8937 | 44394 | 2002-12-01 | 4 | This worked very well and is EASY. I used not... |
| 3 | 126440 | 85009 | 2010-02-27 | 5 | I made the Mexican topping and took it to bunk... |
| 4 | 57222 | 85009 | 2011-10-01 | 5 | Made the cheddar bacon topping, adding a sprin... |
From the following graph it is possible to analyse the rating percentage of the user reviews. From here, it is possible to see that the most of the costumers rating is positive. Hence, it is possible to hope that also the most reviews will be pretty positive.
fig = px.histogram(review_df, x='rating')
fig.update_layout(title_text='Recipe Ratings')
fig.show()
In this step, the classification of the reviews into positive or negative is provided. From the following classification it is possible to generate the training set for our model.
Positive reviews will be classified as +1, and negative reviews will be classified as -1. From that, all reviews with "rating" > 3 will be classified as +1, indicating that they are positive. All reviews with "rating" < 3 will be classified as -1. Reviews with "rating" = 3 will be dropped, because they are neutral.
review_df = review_df[review_df['rating'] != 3]
review_df['sentiment'] = review_df['rating'].apply(lambda x: 1 if x > 3 else -1)
review_df.head()
| user_id | recipe_id | date | rating | review | sentiment | |
|---|---|---|---|---|---|---|
| 0 | 38094 | 40893 | 2003-02-17 | 4 | Great with a salad. Cooked on top of stove for... | 1 |
| 1 | 1293707 | 40893 | 2011-12-21 | 5 | So simple, so delicious! Great for chilly fall... | 1 |
| 2 | 8937 | 44394 | 2002-12-01 | 4 | This worked very well and is EASY. I used not... | 1 |
| 3 | 126440 | 85009 | 2010-02-27 | 5 | I made the Mexican topping and took it to bunk... | 1 |
| 4 | 57222 | 85009 | 2011-10-01 | 5 | Made the cheddar bacon topping, adding a sprin... | 1 |
Once all the reviews are classified with a sentiment, +1 for positive reviews, -1 for negative, it is possible to explore and analyse the classified data. First of all the data are divided into two differents dataframes: one with all the positve reviews, and the another with all the negative reviews.
After removing the neutral reviews from the below information it is possible to say that:
positive_reviews_df = review_df[review_df['sentiment'] == 1]
negative_reviews_df = review_df[review_df['sentiment'] == -1]
print('Number of positive reviews:', positive_reviews_df.shape[0])
print('Numner of negative reviews:', negative_reviews_df.shape[0])
Number of positive reviews: 1003562 Numner of negative reviews: 87784
From the following histogram it is possible to see the distributions of reviews with sentiment across the dataset.
fig = px.histogram(review_df, x='sentiment')
fig.update_traces(marker_color="indianred")
fig.update_layout(title_text='Food Review Sentiment')
fig.show()
In order to feed to model, the data need to be cleaned and preprocessed: it is first necessary to remove all punctuation from the data, then it is needed to remove all the stopwords.
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
[nltk_data] Downloading package stopwords to [nltk_data] C:\Users\Clark\AppData\Roaming\nltk_data... [nltk_data] Package stopwords is already up-to-date!
def preprocess_text(text):
tokens = nltk.word_tokenize(text)
tokens = [w.lower() for w in tokens]
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
words = [word for word in stripped if word.isalpha()]
words = [w for w in words if not w in stop_words]
return ' '.join(words)
clean_review_df = review_df.copy()
clean_review_df['review'] = [preprocess_text(x) for x in clean_review_df['review']]
clean_review_df.head()
| user_id | recipe_id | date | rating | review | sentiment | |
|---|---|---|---|---|---|---|
| 0 | 38094 | 40893 | 2003-02-17 | 4 | great salad cooked top stove minutesadded shak... | 1 |
| 1 | 1293707 | 40893 | 2011-12-21 | 5 | simple delicious great chilly fall evening dou... | 1 |
| 2 | 8937 | 44394 | 2002-12-01 | 4 | worked well easy used quite whole package whit... | 1 |
| 3 | 126440 | 85009 | 2010-02-27 | 5 | made mexican topping took bunko everyone loved | 1 |
| 4 | 57222 | 85009 | 2011-10-01 | 5 | made cheddar bacon topping adding sprinkling b... | 1 |
clean_review_df = clean_review_df.drop(columns=['user_id', 'recipe_id', 'date', 'rating'])
clean_review_df.head()
| review | sentiment | |
|---|---|---|
| 0 | great salad cooked top stove minutesadded shak... | 1 |
| 1 | simple delicious great chilly fall evening dou... | 1 |
| 2 | worked well easy used quite whole package whit... | 1 |
| 3 | made mexican topping took bunko everyone loved | 1 |
| 4 | made cheddar bacon topping adding sprinkling b... | 1 |
It is now possible to build the sentiment model. The model will take reviews in as input. It will then predict if the reviews is positive or negative. Since the problem is a classification problem, it is possible to train a simple logistic regression model.
Logistic regression is kind of like linear regression, but is used when the dependent variable is not a number but something else (e.g., a "yes/no" response). It's called regression but performs classification based on the regression and it classifies the dependent variable into either of the classes.
Then all the dataset is splitted into 2 dataframe, the first 80% of data for training set, the remaining 20% for test set.
train, test = train_test_split(clean_review_df, test_size=0.2)
print('Size of train data', train.shape)
print('Size of test data', test.shape)
Size of train data (873076, 2) Size of test data (218270, 2)
Since logistic regression can't understand text, it is mandatory to convert all the text values into a bag of words model. Hence, the dataframe is transformed into a bag of words model, which will contain a sparse matrix of integers.
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
train_matrix = vectorizer.fit_transform(train['review'])
test_matrix = vectorizer.transform(test['review'])
x_train = train_matrix
x_test = test_matrix
y_train = train['sentiment']
y_test = test['sentiment']
print('X train shape:', x_train.shape)
print('X test shape:', x_test.shape)
print('Y train shape:', y_train.shape)
print('Y test shape:', y_test.shape)
X train shape: (873076, 156385) X test shape: (218270, 156385) Y train shape: (873076,) Y test shape: (218270,)
Now it is possible to apply the logistic regression on the training data to build the model
lr = LogisticRegression(max_iter=1000)
lr.fit(x_train, y_train)
predictions = lr.predict(x_test)
In order to get an evaluation of how the model predicts well the data, it is possible to compute some classification report considering the precision, the recall and the f1-score. Hence, the following steps describe how to compute a confusion matrix for the model and how to compute accuracy measures from the matrix.
conf_matrix = np.asarray(y_test)
confusion_matrix(predictions, y_test)
array([[ 5246, 1705],
[ 12553, 198766]], dtype=int64)
print(classification_report(predictions,y_test))
countvect_lr_acc = round(classification_report(predictions, y_test, output_dict=True)['accuracy'], 2) * 100
precision recall f1-score support
-1 0.29 0.75 0.42 6951
1 0.99 0.94 0.97 211319
accuracy 0.93 218270
macro avg 0.64 0.85 0.69 218270
weighted avg 0.97 0.93 0.95 218270
The overall accuracy of the model on the test data is around 94%, which is pretty good. But, as it is possible to see from the classification report, the precision, recall and f1-score for the negative reviews are very low. This is maybe due to an unbalanced training set. This is possible to see also from the precision of the positive reviews, since it is nearly to 100%. By the way it is possible to balance the training dataset.
From the plots in the above section and the classificaiton results it is clear that the data are unbalanced, so, it is possible to take a subset of the dataset in order to have the same amout of data for both classes.
positive_reviews = clean_review_df[clean_review_df['sentiment'] == 1]
negative_reviews = clean_review_df[clean_review_df['sentiment'] == -1]
positive_sample = positive_reviews.sample(negative_reviews.shape[0], replace=True)
reviews_sample_df = pd.concat([positive_sample, negative_reviews], axis=0)
print('Number of samples selected:', reviews_sample_df.shape[0])
Number of samples selected: 175568
From the following plot it is possible to see that now the training dataset is balances. It contains the same amount of positive reviews and negative reviews.
fig = px.histogram(reviews_sample_df, x='sentiment', color='sentiment')
fig.update_layout(title_text='Food Review Sentiment')
fig.show()
def get_logistic_regression_acc(x_train, y_train, x_test, y_test):
lr = LogisticRegression(max_iter=1000)
lr.fit(x_train, y_train)
predicted = lr.predict(x_test)
return round(metrics.accuracy_score(predicted, y_test),2)*100
def get_multinomial_naive_bayes_acc(x_train, y_train, x_test, y_test):
mn_nb = MultinomialNB()
mn_nb.fit(x_train, y_train)
predicted = mn_nb.predict(x_test)
return round(metrics.accuracy_score(predicted, y_test),2)*100
def get_linear_svm_acc(x_train, y_train, x_test, y_test):
svm = LinearSVC(C=0.01)
svm.fit(x_train, y_train)
predicted = svm.predict(x_test)
return round(metrics.accuracy_score(predicted, y_test),2)*100
After the balancement of the dataset it is possible to rebuild the same Logistic Regression presented before.
train, test = train_test_split(reviews_sample_df, test_size=0.2)
print('Size of train data', train.shape)
print('Size of test data', test.shape)
Size of train data (140454, 2) Size of test data (35114, 2)
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
train_matrix = vectorizer.fit_transform(train['review'])
test_matrix = vectorizer.transform(test['review'])
x_train = train_matrix
x_test = test_matrix
y_train = train['sentiment']
y_test = test['sentiment']
print('X train shape:', x_train.shape)
print('X test shape:', x_test.shape)
print('Y train shape:', y_train.shape)
print('Y test shape:', y_test.shape)
X train shape: (140454, 57771) X test shape: (35114, 57771) Y train shape: (140454,) Y test shape: (35114,)
lr = LogisticRegression(max_iter=1000)
lr.fit(x_train, y_train)
predictions = lr.predict(x_test)
From the results it is possible to see that the accuracy of the model is decreased to 76%. This is maybe due to the balanced data.
conf_matrix = np.asarray(y_test)
confusion_matrix(predictions, y_test)
print(classification_report(predictions,y_test))
balanced_countvect_lr_acc = round(classification_report(predictions, y_test, output_dict=True)['accuracy'], 2) * 100
precision recall f1-score support
-1 0.75 0.76 0.75 17282
1 0.76 0.75 0.76 17832
accuracy 0.76 35114
macro avg 0.76 0.76 0.76 35114
weighted avg 0.76 0.76 0.76 35114
It is now possible to compute other classification algorithm on the balanced data to compare with the Logistic Regression model. It is possible to start with a Naive Bayes Classifier: this is a classification algorithm that relies on Bayes' Theorem. This theorem provides a way of calculating a type or probability called posterior probability, in which the probability of an event A occurring is reliant on probabilistic known background (e.g. event B evidence).
balanced_countvect_nb_acc = get_multinomial_naive_bayes_acc(x_train=x_train, y_train=y_train, x_test=x_test, y_test=y_test)
print('Accuracy of Multinomial Naive Bayes is:', balanced_countvect_nb_acc, '%')
Accuracy of Multinomial Naive Bayes is: 75.0 %
The accuracy of Multinomial Naive Bayes is 75%, very close to the Logistic Regression
Again, it is possible to compute an other classification algorithm: Support Vector Machines. SVM is provided to solve classification and regression problems. It can easily handle multiple continuous and categorical variables. SVM constructs a hyperplane in multidimensional space to separate different classes. SVM generates optimal hyperplane in an iterative manner, which is used to minimize an error. The core idea of SVM is to find a maximum marginal hyperplane(MMH) that best divides the dataset into classes
balanced_countvect_svm_acc = get_linear_svm_acc(x_train=x_train, y_train=y_train, x_test=x_test, y_test=y_test)
print('Accuracy of SVM is:', balanced_countvect_svm_acc, '%')
Accuracy of SVM is: 76.0 % D:\Projects\Python\food.com-sentiment-analysis\venv\lib\site-packages\sklearn\svm\_base.py:985: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
The accuracy of SVM is 76%, very close to the Logistic Regression and the Naive Bayes Classifier
It is possible to try using TF-IDF instead of CountVectorizer to embed the words and train the algorithm presented on the above sections
tfidf_vectorizer = TfidfVectorizer()
train_matrix = tfidf_vectorizer.fit_transform(train['review'])
test_matrix = tfidf_vectorizer.transform(test['review'])
x_train = train_matrix
x_test = test_matrix
y_train = train['sentiment']
y_test = test['sentiment']
balanced_tfidf_ls_acc = get_logistic_regression_acc(x_train=x_train, y_train=y_train, x_test=x_test, y_test=y_test)
print('Accuracy of Logistic Regression is:', balanced_tfidf_ls_acc, '%')
Accuracy of Logistic Regression is: 76.0 %
balanced_tfidf_nb_acc = get_multinomial_naive_bayes_acc(x_train=x_train, y_train=y_train, x_test=x_test, y_test=y_test)
print('Accuracy of Multinomial Naive Bayes is:', balanced_tfidf_nb_acc, '%')
Accuracy of Multinomial Naive Bayes is: 75.0 %
balanced_tfidf_svm_acc = get_linear_svm_acc(x_train=x_train, y_train=y_train, x_test=x_test, y_test=y_test)
print('Accuracy of SVM is:', balanced_tfidf_svm_acc, '%')
Accuracy of SVM is: 76.0 %
From the above results it is possible to see that the accuracy using TF-IDF is not either better either worse than the accuracy with CountVectorizer
In order to improve the accuracy of the models it is possible to clean the dataset in a more complicated way. The data will be normalized using two provided methods: Stemming and Lemmatization. Normalization is a common step in text preprocessing, with this technique words with different forms are converted into one. Stemming is considered to be the more crude/brute-force approach to normalization, this algorithm use basic rules to chop off the ends of words. Lemmatization works by identifying the part-of-speech of a given word and then applying more complex rules to transform the word into its true root.
nltk.download('wordnet')
[nltk_data] Downloading package wordnet to [nltk_data] C:\Users\Clark\AppData\Roaming\nltk_data... [nltk_data] Package wordnet is already up-to-date!
True
def get_stemmed_text(corpus):
stemmer = PorterStemmer()
return [' '.join([stemmer.stem(word) for word in review.split()]) for review in corpus][0]
def get_lemmatized_text(corpus):
lemmatizer = WordNetLemmatizer()
return [' '.join([lemmatizer.lemmatize(word) for word in review.split()]) for review in corpus][0]
reviews_sample_df['review'] = reviews_sample_df['review'].apply(lambda x: get_stemmed_text([x]))
reviews_sample_df['review'] = reviews_sample_df['review'].apply(lambda x: get_lemmatized_text([x]))
reviews_sample_df.head()
| review | sentiment | |
|---|---|---|
| 520022 | list new breadmak fri think fri egg sandwich g... | 1 |
| 1068352 | good made babi shower big hit ad coconut run l... | 1 |
| 465301 | great altern tradit method usual use recip nic... | 1 |
| 69002 | reali easi sooo tasti dabbl recip bit doubl am... | 1 |
| 237916 | delciou crispi moist gravi da bomb love tyme c... | 1 |
In order to test the model and see if the normalization can make improvement on the classification, it is possible to rebuild the previous algorithm with the normalizated dataset using one of the tow bag of words model presented above.
train, test = train_test_split(reviews_sample_df, test_size=0.2)
print('Size of train data', train.shape)
print('Size of test data', test.shape)
Size of train data (140454, 2) Size of test data (35114, 2)
tfidf_vectorizer = TfidfVectorizer()
train_matrix = tfidf_vectorizer.fit_transform(train['review'])
test_matrix = tfidf_vectorizer.transform(test['review'])
x_train = train_matrix
x_test = test_matrix
y_train = train['sentiment']
y_test = test['sentiment']
print('X train shape:', x_train.shape)
print('X test shape:', x_test.shape)
print('Y train shape:', y_train.shape)
print('Y test shape:', y_test.shape)
X train shape: (140454, 44898) X test shape: (35114, 44898) Y train shape: (140454,) Y test shape: (35114,)
normalized_tfidf_lr_acc = get_logistic_regression_acc(x_train=x_train, y_train=y_train, x_test=x_test, y_test=y_test)
print('Accuracy of Logistic Regression is:', normalized_tfidf_lr_acc, '%')
Accuracy of Logistic Regression is: 76.0 %
normalized_tfidf_nb_acc = get_multinomial_naive_bayes_acc(x_train=x_train, y_train=y_train, x_test=x_test, y_test=y_test)
print('Accuracy of Multinomial Naive Bayes is:', normalized_tfidf_nb_acc, '%')
Accuracy of Multinomial Naive Bayes is: 73.0 %
normalized_tfidf_svm_acc = get_linear_svm_acc(x_train=x_train, y_train=y_train, x_test=x_test, y_test=y_test)
print('Accuracy of SVM is:', normalized_tfidf_svm_acc, '%')
Accuracy of SVM is: 76.0 %
As it is possible to see from the above results the accuracy are very close (some are the same) to the accuracy obtained in the previous section of the project.
N-Grams are two or three word sequences (bigrams or trigrams). With the implementation of N-Grams (n-words association) the model can potentially be more predictive. For example, if a review had the three word sequence "didn't love recipe" the model would only consider these words individually with a unigram-only model and probably not capture that this is actually a negative sentiment because the word "love" by itself is going to be highly correlated with a positive review.
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2))
train_matrix = tfidf_vectorizer.fit_transform(train['review'])
test_matrix = tfidf_vectorizer.transform(test['review'])
x_train = train_matrix
x_test = test_matrix
y_train = train['sentiment']
y_test = test['sentiment']
print('X train shape:', x_train.shape)
print('X test shape:', x_test.shape)
print('Y train shape:', y_train.shape)
print('Y test shape:', y_test.shape)
X train shape: (140454, 947570) X test shape: (35114, 947570) Y train shape: (140454,) Y test shape: (35114,)
normalized_ngram_tfidf_lr_acc = get_logistic_regression_acc(x_train=x_train, y_train=y_train, x_test=x_test, y_test=y_test)
print('Accuracy of Logistic Regression is:', normalized_ngram_tfidf_lr_acc, '%')
Accuracy of Logistic Regression is: 77.0 %
normalized_ngram_tfidf_nb_acc = get_multinomial_naive_bayes_acc(x_train=x_train, y_train=y_train, x_test=x_test, y_test=y_test)
print('Accuracy of Multinomial Naive Bayes is:', normalized_ngram_tfidf_nb_acc, '%')
Accuracy of Multinomial Naive Bayes is: 75.0 %
normalized_ngram_tfidf_svm_acc = get_linear_svm_acc(x_train=x_train, y_train=y_train, x_test=x_test, y_test=y_test)
print('Accuracy of SVM is:', normalized_ngram_tfidf_svm_acc, '%')
Accuracy of SVM is: 76.0 %
As it is possible to see from the above results the accuracy of Multinomial Naive Bayes classifier is sligthly better (75% instead of 73%), also the Logistic Regression accuracy is increased (77% instead of 76%) and the accuracy of the other algorithm (Linear Support Vector Machine) are the same.
In order to compare the previous model it is possible to compute a senitiment analysis with a standalone senitiment analysis tool. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. VADER uses a combination of a sentiment lexicon: a list of lexical features (e.g., words) which are generally labelled according to their semantic orientation as either positive or negative. VADER has been found to be quite successful when dealing with social media texts, NY Times editorials, movie reviews, and product reviews. This is because VADER not only tells about the Positivity and Negativity score but also tells us about how positive or negative a sentiment is
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()
def sentiment_analyzer_scores(sentence):
score = analyser.polarity_scores(sentence)
print("{:-<4} {}".format(sentence, str(score)))
return score['compound']
sentiment_analyzer_scores('This recipe is super easy and tasty')
This recipe is super easy and tasty {'neg': 0.0, 'neu': 0.424, 'pos': 0.576, 'compound': 0.7783}
0.7783
From the above cells it is possible to see the use of VADER Sentiment analysis. Given a sentence, it returns with a dictionary of negative, neutral, positive and compound values. The Positive, Negative and Neutral scores represent the proportion of text that falls in these categories. For example, in the above cell the sentence was rated as 57% Positive, 42% Neutral and 0% Negative. Hence, all of this will sum up to a positive classification. The Compound score is a metric that calculates the sum of all the lexicon ratings which have been normalized between -1(most extreme negative) and +1 (most extreme positive). In the case above, the compound score turns out to be 0.77 , denoting a very high positive sentiment.
It is now possible to applicate the VADER Sentiment analysis on the normalized, cleaned and balanced data in order to test its accuracy.
%%capture
reviews_sample_df['vader sentiment'] = reviews_sample_df.apply(lambda x: sentiment_analyzer_scores(x['review']), axis=1)
reviews_sample_df.head()
| review | sentiment | vader sentiment | |
|---|---|---|---|
| 520022 | list new breadmak fri think fri egg sandwich g... | 1 | 0.7650 |
| 1068352 | good made babi shower big hit ad coconut run l... | 1 | 0.8807 |
| 465301 | great altern tradit method usual use recip nic... | 1 | 0.8555 |
| 69002 | reali easi sooo tasti dabbl recip bit doubl am... | 1 | 0.8957 |
| 237916 | delciou crispi moist gravi da bomb love tyme c... | 1 | 0.9217 |
Now, all the reviews have a computed vader sentiment. In order to test the accuracy and the precision of VADER, it is needed to classify the vader sentiment into +1 (positive review) or -1 negative review. As said before, reviews that have a VADER sentiment greater than 0 will be classified as positives, instead, reviews that have a VADER sentiment less then 0 will be classified as negatives.
reviews_sample_df['vader classification'] = reviews_sample_df['vader sentiment'].apply(lambda x: 1 if x > 0 else -1)
reviews_sample_df.head()
| review | sentiment | vader sentiment | vader classification | |
|---|---|---|---|---|
| 520022 | list new breadmak fri think fri egg sandwich g... | 1 | 0.7650 | 1 |
| 1068352 | good made babi shower big hit ad coconut run l... | 1 | 0.8807 | 1 |
| 465301 | great altern tradit method usual use recip nic... | 1 | 0.8555 | 1 |
| 69002 | reali easi sooo tasti dabbl recip bit doubl am... | 1 | 0.8957 | 1 |
| 237916 | delciou crispi moist gravi da bomb love tyme c... | 1 | 0.9217 | 1 |
vader_accuracy = round(classification_report(reviews_sample_df['vader classification'], reviews_sample_df['sentiment'], output_dict=True)['accuracy'], 2) * 100
print(classification_report(reviews_sample_df['vader classification'], reviews_sample_df['sentiment']))
precision recall f1-score support
-1 0.26 0.77 0.38 29230
1 0.92 0.55 0.69 146338
accuracy 0.59 175568
macro avg 0.59 0.66 0.54 175568
weighted avg 0.81 0.59 0.64 175568
In order to have an evaluation of the models presented before it is possible to compare with a state-of-art model presented in the following paper:
In the paper is presented "FastText": a library for efficient learning of word representations and sentence classification. Its goal is to provide word embedding and text classification efficiently. According to their authors, it is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation.
The core of FastText relies on the Continuous Bag of Words (CBOW) model for word representation and a hierarchical classifier to speed up training. CBOW is a shallow neural network that is trained to predict a word from its neighbors. FastText replaces the objective of predicting a word with predicting a category. These single-layer models train incredibly fast and can scale very well. Also, fastText replaces the softmax over labels with a hierarchical softmax. Here each node represents a label. This reduces computation as we don’t need to compute all labels probabilities. The limited number of parameters reduces training time.
fasttext_train_path = '../exported_datasets/fasttext/fasttext_train.txt'
fasttext_test_path = '../exported_datasets/fasttext/fasttext_test.txt'
reviews_sample_df.head()
| review | sentiment | vader sentiment | vader classification | |
|---|---|---|---|---|
| 520022 | list new breadmak fri think fri egg sandwich g... | 1 | 0.7650 | 1 |
| 1068352 | good made babi shower big hit ad coconut run l... | 1 | 0.8807 | 1 |
| 465301 | great altern tradit method usual use recip nic... | 1 | 0.8555 | 1 |
| 69002 | reali easi sooo tasti dabbl recip bit doubl am... | 1 | 0.8957 | 1 |
| 237916 | delciou crispi moist gravi da bomb love tyme c... | 1 | 0.9217 | 1 |
FastText needs labeled data to train the supervised classifier. Labels must start by the prefix "label", which is how it recognizes what a label or what a word is, followed by "NEGATIVE" or "POSITIVE" based on how the review is classified. Hence, it is possible to create a new column on the dataset with the review written in that way.
def get_review_line(sentence, sentiment):
if sentiment == 1:
return f'__label__POSITIVE {sentence}'
else:
return f'__label__NEGATIVE {sentence}'
reviews_sample_df['labelled'] = reviews_sample_df.apply(lambda x: get_review_line(x['review'], x['sentiment']), axis=1)
reviews_sample_df.head()
| review | sentiment | vader sentiment | vader classification | labelled | |
|---|---|---|---|---|---|
| 520022 | list new breadmak fri think fri egg sandwich g... | 1 | 0.7650 | 1 | __label__POSITIVE list new breadmak fri think ... |
| 1068352 | good made babi shower big hit ad coconut run l... | 1 | 0.8807 | 1 | __label__POSITIVE good made babi shower big hi... |
| 465301 | great altern tradit method usual use recip nic... | 1 | 0.8555 | 1 | __label__POSITIVE great altern tradit method u... |
| 69002 | reali easi sooo tasti dabbl recip bit doubl am... | 1 | 0.8957 | 1 | __label__POSITIVE reali easi sooo tasti dabbl ... |
| 237916 | delciou crispi moist gravi da bomb love tyme c... | 1 | 0.9217 | 1 | __label__POSITIVE delciou crispi moist gravi d... |
FastText needs also two file: a training file with all the labelled training reviews, and a test file, with all the labelled testing reviews. Hence, it is possible to split the dataset into train and test with the same amount of the experiment before (train 80%, test 20%) and save them into two ".txt" file
fasttext_train, fasttext_test = train_test_split(reviews_sample_df, test_size=0.2)
print('Shape of training dataset:', fasttext_train.shape)
print('Shape of test dataset', fasttext_test.shape)
Shape of training dataset: (140454, 5) Shape of test dataset (35114, 5)
with open(fasttext_train_path, 'w') as train_file:
reviews = fasttext_train['labelled'].tolist()
for review in reviews:
train_file.write(review + '\n')
with open(fasttext_test_path, 'w') as test_file:
reviews = fasttext_test['labelled'].tolist()
for review in reviews:
test_file.write(review + '\n')
FastText can be trained with the use of his predefined function "train_supervised". For this experiment it is possible to focus on the following argoments of the method:
First of all, since in the previous experiment the model has been trained without the N-Grams it is possible to run FastText without N-Grams in order to compare them. Then it is possible to add a bi-gram (2-grams) in order to check if model can improve.
hyper_params = {"lr": 0.01,
"epoch": 20,
"wordNgrams": 1,
"dim": 20}
model = fasttext.train_supervised(input=fasttext_train_path, **hyper_params)
print("Model trained with the hyperparameter \n {}".format(hyper_params))
Model trained with the hyperparameter
{'lr': 0.01, 'epoch': 20, 'wordNgrams': 1, 'dim': 20}
hyper_params_ngrams = {"lr": 0.01,
"epoch": 20,
"wordNgrams": 2,
"dim": 20}
model_ngrams = fasttext.train_supervised(input=fasttext_train_path, **hyper_params_ngrams)
print("Model trained with the hyperparameter \n {}".format(hyper_params_ngrams))
Model trained with the hyperparameter
{'lr': 0.01, 'epoch': 20, 'wordNgrams': 2, 'dim': 20}
From the following code it is possible to evaluate the two FastText models, the one without N-Grams has an accuracy of 76%, the one with 2-Grams has an accuracy of 77%
test = model.test(fasttext_test_path)
fasttext_accuracy = round(test[1], 2) * 100
print('Accuracy of FastText without N-Grams in testing data is:', fasttext_accuracy, '%')
Accuracy of FastText without N-Grams in testing data is: 76.0 %
test_ngrams = model_ngrams.test(fasttext_test_path)
fasttext_accuracy_ngrams = round(test_ngrams[1], 2) * 100
print('Accuracy of FastText with N-Grams in testing data is:', fasttext_accuracy_ngrams, '%')
Accuracy of FastText with N-Grams in testing data is: 77.0 %
In order to have a clear evaluation of all the models provided and all the results it is possible to compare all of them into a table. The best results so far are obtained with a normalized dataset with TF-IDF and N-Grams using the Logistic Regression (77%). But it must be sad that all the results of the models are very close to each other. Vader is the only model that has produced poor results (59%). This is maybe due to the normalization, since this library can work really well with raw data, punctuation and block letters. Also, it is possible to state that state-of-art paper "FastText" produced results very similar to the ones produced by all the other models.
The first Logistic Regression applied, with an accuracy of 94%, is the one that uses the unbalanced dataset. From the results of precision and recall it is possible to state that the accuracy of this model is is very influenced by the classification of positive reviews, since they were almost 80% of the entire dataset.
Concluding, it is possible to state that despite the application of various models and differentiated balanced datasets, the accuracy percentage of this dataset is always around 76%.
Furthermore, the analysis may not end here: future developments could be search if there is a correlation between the number of words, the number of ingredients and the recipe instructions with the user rating.
accuracy_index = ['Baseline with CountVectorizer', 'Balanced with CountVectorizer', 'Balanced with TF-IDF', 'Normalized with TF-IDF', 'Normalized with TF-IDF and N-Grams']
accuracy_data = {'Logistic Regression': [countvect_lr_acc, balanced_countvect_lr_acc, balanced_tfidf_ls_acc, normalized_tfidf_lr_acc, normalized_ngram_tfidf_lr_acc],
'Multinomial Naive Bayes': ['-', balanced_countvect_nb_acc, balanced_tfidf_nb_acc, normalized_tfidf_nb_acc, normalized_ngram_tfidf_nb_acc],
'Linear SVM': ['-', balanced_countvect_svm_acc, balanced_tfidf_svm_acc, normalized_tfidf_svm_acc, normalized_ngram_tfidf_svm_acc],
'VADER': ['-', '-', '-', vader_accuracy, '-'],
'FastText': ['-', '-', '-', fasttext_accuracy, fasttext_accuracy_ngrams]}
accuracy_results_df = pd.DataFrame(accuracy_data, index=accuracy_index)
accuracy_results_df
| Logistic Regression | Multinomial Naive Bayes | Linear SVM | VADER | FastText | |
|---|---|---|---|---|---|
| Baseline with CountVectorizer | 93.0 | - | - | - | - |
| Balanced with CountVectorizer | 76.0 | 75.0 | 76.0 | - | - |
| Balanced with TF-IDF | 76.0 | 75.0 | 76.0 | - | - |
| Normalized with TF-IDF | 76.0 | 73.0 | 76.0 | 59.0 | 76.0 |
| Normalized with TF-IDF and N-Grams | 77.0 | 75.0 | 76.0 | - | 77.0 |